Global statistical analysis of the protein homology network 



O 

o 

(N 



in 



6 



C. MiccicQ 

Dipartimento di Fisica G.Occhialini, Universita di Milano-Bicocca and INFN, 
Sezione di Milano, Piazza della Scienza 3 - 1-20126 Milano, Italy 

T. Rattefl 

Department of Genome Oriented Bioinformatics, Technical University of Munich, 
Wissenschaftszentrum 5 Weihenstephan, 85350 Freising, Germany 
(Dated: February 6, 2008) 

The similarity between protein sequences is a directly and easly computed quantity from which to 
deduce information about their evolutionary distance and to detect homologous proteins. The SIMAP 
database - Similarity Matrix of Proteins - provides a pre-computed similarity matrix covering the 
similarity space formed by about all publicly available amino acid sequences from public databases 
and completely sequenced genomes. From SIMAP we construct the protein homology network, where 
the proteins are the nodes and the links represent homology relationships. With more than 5 million 
nodes and about 70 x 10 9 edges it is the greatest protein homology network ever been builded. We 
describe the basic features and we perform a global statistical analysis of the network. Starting from the 
Smith-Waterman similarity score, we define for each edge a weight w to measure the similarity distance 
between two nodes. Keeping only edges with a weigth greater than a minimal w, and by varying w 
we build a family of networks with different degree of similarity. We investigate the distribution of 
connected components (clusters) of the networks at different w and in particular we find a behaviour 
similar to a phase transition guided by the formation of a giant component. Moreover we study selected 
sequence features and protein domains of protein pairs that connect different clusters in the networks 
at different level of similarity. We observed specific, non-random distributions of the protein features 
and domains for proteins connecting clusters at certain weight intervals. 
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I. BACKGROUND 

The number of known proteins is rapidly growing and 
the sequence of amino acids is, at the moment, the main 
source of information for many new proteins which still 
have unidentified functions. Protein sequence analysis, 
and more specifically, the analysis of similarities among 
protein sequences, is therefore the basis of studies trying 
to understand protein evolutionary processes or to de- 
tect unknown biological functions of new proteins. Pro- 
teins with similar sequences can be found in different or- 
ganisms and in a single organism lfl2tl . iflfl . By means 
of the degree of similarity obtained by a pairwise se- 
quence comparison it is possible to deduce information 
about their evolutionary distance. Specifically, two pro- 
teins are homologous if they evolved from a common an- 
cestral protein sequence and, in most cases, they have 
also the same, or very similar, biological function. Ho- 
mology can be deduced from statistically significant se- 
quence similarities. However, new sequences often have 
only weak similarities to known proteins, and single sim- 
ilarities search are insufficient to assign validated prop- 
erties of characterized proteins to new sequences. In- 
stead a graph formed by all-against-all comparisons of a 
large amount of protein-data could become useful. This 
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is the case of SIMAP - Similarity Matrix of Proteins - 
a database containing the similarity space formed by 
almost all amino acid sequences, with nearly 5.5 mil- 
lion non-redundant protein sequences drawn from com- 
pletely sequenced genomes and public database. More- 
over, pre-calculated similarity space allows very rapid 
access to significant hits of interest and prevents time- 
consuming re-computation. The algorithm that precom- 
putes the sequences similarities is based on the FASTA 
heuristic. First it compares low-complexity masked pro- 
teins using FASTA and then it recalculates the hits found 
using non-masked sequences and the Smith-Waterman 
algorithm. In both phases of the alignment process the 
BLOSUM50 amino acids substitution matrix is used. For 
each hit the Smith-Waterman score, the identity, the 
gapped identity, the overlap and the start and the stop 
coordinates of the alignment in both proteins are stored. 
For more details see ||2[] • 

Graphs formed by all-against-all sequence comparisons 
can be used to derive inheritance patterns of proteins, to 
reconstruct the evolutionary relationships between pro- 
teins and to classify them into protein families by look- 
ing for dense clusters disconnected from the rest of the 
network. To date, this approach has been carefully eval- 
uated by case studies targeted at selected protein fami- 
lies yD, but a global analysis of the complete homology 
network formed by all publicly available proteins has not 
been published. The aim of this work is to analyze global 
and local properties of the graph forming the homology 
network. 
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II. SIMAP GRAPH REPRESENTATION 



The information contained in the Simap database can 
be reorganized by means of a weighted graph represen- 
tation, G(V, E, w), where V is the set of nodes, E the 
set of edges, and w a weight function on the edges: 
w : E — > [0, 1]. Each node, a £ V, represents a protein 
sequence and each edge, e = {a, b} £ B between two 
nodes a, b represents the stored alignment between the 
respective protein sequences lfl3f1 . In this way an undi- 
rected weighted graph can be obtained, since the symme- 
try of the alignment procedure leads to undirected edges 
and the score of the alignment allows the assignment of 
a suitable weight to every edge. (Despite the possibil- 
ity of making an alignment between a protein sequence 
and itself, self-edges are not considered). More specifi- 
cally if s(a, b) is the Smith-Waterman (SW) optimal score 
obtained with the FASTA algorithm between sequence a 
and b, a suitable weight to (a, b) e [0,1] for the edge 
e = {a, b} can be defined as follow: 



d(a, b) = ^1 - w(a, b), 



(2) 



w(a, b) 



s(a,b) 



v / s(a,a) s(b,b)' 



(1) 



From w(a, b) one could define a distance function as 
d(a, b) = 1 — w(a, b) , whose values are in [0,1] as 
distance function usually defined on linear spaces, d 
should satisfy positivity null and simmetry properties for 
all pairs of sequence proteins and also the triangular in- 
equality which is fully satisfied for the BLOSUM50 ma- 
trix. 



we have that the triangle inequality is satisfied for all 
triples of linked proteins and © has all properties re- 
quired for a distance measure. 



IV. CHARACTERIZATION OF SIMAP PROTEIN SPACE 

In the Simap database, protein sequences come from 
104, 560 different species. There are, in particular, 3 
species (Homo sapiens, Arabidopsis thaliana, Rice plants) 
with more than 100, 000 protein sequences and 72 with 
more than 10, 000. 



kingdoms 




number of species 


bacteria 




11, 130 


viruses 


viruses 


13, 708 




phages 


923 


plants 




31,232 


animalia 


invertebrates 


25,951 




vertebrates 


19,341 




(rodents) 


(1,474) 




(mammals) 


(1,854) 




(primates) 


(393) 


environmental samples 




1,453 


synthetic 




822 



TABLE I: Number of species for each kingdom. 



III. POLISHING PROCEDURE 

Strictly speaking, the set of all protein sequences of 
the Simap database is not a good space over which to 
define the distance measure d. There are, in fact, 1538 
pairs of sequences that have distance equal to zero, al- 
though they are classified with a different sequence id. 
However, they differ only in the presence of one or two 
'X' in their amino acid sequence annotation, where 'X' is 
the standard symbol for an unknown amino acid residue 
in a protein sequence. It is therefore natural to decide to 
knock out, for each of these pairs of sequences, the one 
that has the 'X' in the sequence; this procedure entails 
the removal, in the graph representation, of all edges 
connected to the removed nodes. Another improvment 
for database consistency is the checking of symmetry of 
all edges: every time, a direct edge is found, the inverse 
relation, if absent, is added. 

As a final result of these manipulations, a graph with 
V = 5,489,907 nodes and E = 69,500,722,050 edges 
can be constructed. 

Over the polished Simap protein sequences space the 
distance d = 1 — w(a, b) fails the triangular inequality 
over few cases (around rj 0.2% of triangles). However 
redefining, for istance, 



A coarse subdivision of all species is shown in Ta- 
ble HI it separates species in five (non-standard) main 
kingdoms: bacteria, viruses, plants, invertebrates (an- 
imalia) and vertebrates (animalia). The classification 
reveals the presence of very many different animalia 
species, but only eight of these species are present with 
their complete genome (the other animalia proteins were 
imported from multiple species databases). Figure Q] 
shows the protein distribution for each kingdom. There 
is also a high number (546,439) of unassigned protein 
sequences. 111411 . 



A. Length and self-similarity distribution 

The protein sequences space is characterized by the 
length distribution shown in Figure |2h and in Figure [2b 
we give the length distributions for sequences belong- 
ing to bacteria, viruses, plants, vertebrates and inverte- 
brates. 

The self-similarity score 's distribution of protein se- 
quence appears in Figure [3j The self-similarity scores 
distribution is well reproduced by a mixture of nor- 
mal distributions, one for each length entry. The self- 
similarity score s(a, a) of a protein sequence of length 
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Distribution of proteins for each kingdoms 



VERTEBRATES 




plants invertebrates vertebrates unassigned 



FIG. 1: Distribution of proteins for each kingdom. The little 
graph shows the distribution within vertebrates. 



Distribution of protein sequences's lengths 
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BACTERIA protein sequences' length di 
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VIRUSES protein sequences's length di 



PLANTS protein sequences' length distribution 
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INVERTEBRATES protein sequences' length dist 
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sequence's length 

FIG. 2: (a) Distribution of protein sequences' lengths. In the 
inner boxe an enlargement of the distribution is shown, (b) 
Length distributions of protein sequences which belong to bac- 
teria {(I) = 316.9, Z max = 36805), viruses ((I) = 273.9,l m a* = 
7312 ), plants ((/) = 314.5, l maas = 20925), invertebrated ((Z) = 
416.1, Imax = 23015), vertebrated ((/> = 397.1, Z max = 38031). 



I, can be thougth as a sum of I i.i.d. random vari- 



Distribution of protein sequences's self-scores 



2000 4000 6000 



sequence's selfscore 



FIG. 3: Distribution of protein sequences' self-scores. In the 
inner boxe an enlargement of the distribution is shown. 



ables, i.e. a sum of the self-similarities scores of ran- 
dom amino acids. Knowing the amino acids background 
probabilities JTH] p a and the diagonal values of the BLO- 
SUM50 score matrix, B aa , the self-similarity score of 
a random amino acid will follow a normal distribution 
with mean (s) = X! a Pa-Saa (~ 6.727) and variance 
a = ^/Y, a Pa B la ~ ( s ) 2 (~ 2.067). Self-similarity scores 
of random amino acid sequences of length I will have a 
normal distribution g(l, s) with mean I (s) and variance 
\/l a 2 . Finally, the self-similarity scores distribution is 
well approximated by the sum J^i g(_h s )/(0> where /(/) 
is the observed length distribution, Figure 01 

Estimate of self-scores distribution by an overlap of gaussian distributions 




sequence's sellscore 



FIG. 4: Distribution of protein sequences' self-scores and the 
curve obtained by an overlap of normal distributions oppor- 
tunely wighted by the protein sequences's length distribution 
are compared. 



B. Pairwise similarity distribution 

The SW optimum similarity scores distribution ob- 
tained from all FASTA sequence alignments present a 
homogeneous cutoff equal to 80, used for storing hits 
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in Simap database. It was chosen independently of the 
query and database length, but as an optimal compro- 
mise between sensitivity and possibility to store an ac- 
cessible number of hits, because of the high number of 
protein sequences. 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

weight w 



(a) 

Repartition function of edges' weights distribution 
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FIG. 5: (a) Distribution of edges' weights w. In the inner box is 
shown an enlargement of the distribution tail, (b) Repartition 
function edges' weights distribution. 

In Figure [5h the distribution of weights w is shown, 
and in Figure [5b the corresponding repartition distribu- 
tion p(w). The values of p(w) e [0, 1] represent the frac- 
tions of edges which have weight greater or equal to w. 
From them we see that the major part of the edges (about 
80% of the total number of edges) has a very low value 
of i£> (< 0.2). 



C. Coordination and cluster distribution 

Weights w can be used as a parameter to define a col- 
lection of graphs. For a fixed value of w = w (or a value 
of d = d = \fl — w ), a graph is built keeping only edges 
with w > w (d < d). For high values of w, i.e. at 
small distances, nodes are linked if, and only if, the cor- 



responding protein sequences have a high degree of sim- 
ilarity; then it is reasonable to expect graphs with many 
small connected components. By decreasing w l values, in 
other words by also linking proteins having a lower de- 
gree of similarity, graphs with larger connected compo- 
nents are expected. The graph obtained by considering 
all possible edges (by fixing w = 0) is not the complete 
graph, due to the cutoff on the score alignment (there 
are about 0.1% of edges of the corresponding complete 
graph). 

We have built graphs for values of w equal to 0.975, 
0.95, 0.925, 0.9, 0.875, 0.85, 0.825, 0.8, 0.775, 0.75, 0.725, 
0.7, 0.675, 0.65, 0.625, 0.6, 0.575, 0.55, 0.525, 0.5, 0.475, 
0.45, 0.425, 0.4 0.375, 0.35, 0.325, 0.3, 0.275, 0.25, 0.225, 
0.2, 0.175, 0.15, 0.125; for each of these values the set 
of the protein sequences splits into clusters, i.e. isolated 
connected components. Linking proteins that have a 
greater and greater distance from each other (decresing 
w), clusters merge to form larger clusters, the number of 
isolated proteins and the number of components with a 
very small size decreases, while the number of clusters 
of medium and large size increases. 

Measuring the (not normalized) cluster distribution, 
we find that, for each fixed values of w, the number of 
clusters n^s) of size s follows, in a specific size range, 
a power law behaviour, riyj(s) ~ s~ a ( w \ Fitted values of 
a(w) and fitting size ranges are reported in Table [TT] and 
a log-log plot of size distribution n^s), for three differ- 
ent values of w is shown in Figure |6(a)[ Also the (not 
normalized) coordination degree distribution fm(z) fol- 
lows a power law distribution, }w(z) ~ z~ a ( w \ for each 
values of Hi. A log-log plot of coordination degree distri- 
bution fw{z), for three different values of w is shown in 
Figure [6(b)] Fitted values of a(w) and fitting coordina- 
tion degree's ranges are reported in Table ITUl 



w 


a 


component 


correlation 








size range 


coefficient 


0.95 


2 


70 


10 - 60 


-0 


995 


0.90 


2 


70 


10 - 60 


-0 


996 


0.85 


2 


69 


10 - 60 


-0 


994 


0.80 


2 


62 


10 - 80 


-0 


996 


0.75 


2 


52 


10 - 80 


-0 


996 


0.70 


2 


40 


10 - 80 


-0 


996 


0.65 


2 


32 


10 - 100 


-0 


997 


0.60 


2 


21 


10 - 100 


-0 


996 


0.55 


2 


17 


10 - 100 


-0 


996 


0.50 


2 


07 


10 - 100 


-0 


997 


0.45 


2 


01 


10 - 100 


-0 


997 


0.40 


2 


00 


10 - 100 


-0 


996 


0.35 


1 


98 


10 - 100 


-0 


997 


0.30 


1 


98 


10 - 100 


-0 


997 


0.25 


2 


01 


10 - 100 


-0 


996 



TABLE II: Fitting values of exponent a of the power law distri- 
bution of connected components for selected values of w. For 
each fitting the size range and its correlation coefficient are re- 
ported. 
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Size distribution of connected components 



* w = 0.975 

: w = 0.55 

o w = 0.2 




log s 

(a) 



Coordination 



distribution 



w = 0.975 
w = 0.55 . 
w = 0.2 




log z 



(b) 



FIG. 6: (a) Distribution of size of connected components of the 
protein sequences graph built at w = 0.975 (red curve), Hi = 
0.75 (pink curve) and w = 0.4 (blue curve). It is evident that 
as the w value decrease the number of connected components 
with small size decreases and the starting region of the power 
law behaviour shifts to higher values of size, (b) Distribution 
of coordination degree of the protein sequences graph built at 
w = 0.975 (red curve), w = 0.75 (pink curve) and w = 0.4 
(blue curve). As the id value decrease the number of nodes 
with coordination degree decreases and the starting region of 
the power law behaviour shifts to higher values of coordination 
degree. 



V. COMPARISON WITH GENERALIZED RANDOM GRAPHS 

It would be interesting to compare these behaviours 
with that of a model of random graphs. It is well known 
that, in the classical model, random graphs (where every 
pair of nodes is chosen to be an edge with probability p, 
as introducede by Erdos-Renyi [4]), have the same ex- 
pected coordination degree at every node, so they are 
characterized by a poissonian coordination degree dis- 
tribution with mean value (z) ~ pV . Futhermore, as 



w 




max z 


a 


coordination 
degree range 


correlation 
coefficient 


0.95 


14.4 


5735 


1.59 
1.46 


25 - 
100 - 


100 
500 


-0.990 
-0.953 


0.90 


73.1 


10794 


1.58 
1.51 


25 - 
100 - 


100 
500 


-0.988 
-0.939 


0.85 


138.3 


16500 


1.68 
1.42 


25 - 
100 - 


100 
800 


-0.993 
-0.964 


0.80 


207.2 


23726 


1.73 
1.29 


25 - 
100 - 


100 
800 


-0.994 
-0.941 


0.75 


294.0 


33265 


1.79 
1.22 


25 - 
100 - 


100 
1000 


-0.997 
-0.956 


0.70 


395.3 


35202 


1.74 
1.28 


25 - 
100 - 


100 
1000 


-0.996 
-0.946 


0.65 


507.8 


36333 


1.71 
1.39 


25 - 
100 - 


100 
1000 


-0.998 
-0.950 


0.60 


622.3 


37729 


1.63 
1.32 


25 - 
100 - 


100 
1500 


-0.999 
-0.930 


0.55 


745.3 


41871 


1.54 
1.44 


25 - 
100 - 


100 
1500 


-0.998 
-0.927 


0.50 


911.7 


49895 


1.44 
1.56 


25 - 
100 - 


100 
2000 


-0.998 
-0.944 


0.45 


1108.1 


51309 


1.38 
1.62 


25 - 
100 - 


100 
2000 


-0.998 
-0.951 


0.40 


1314.2 


51956 


1.28 
1.67 


25 - 
100 - 


100 
2500 


-0.998 
-0.946 


0.35 


1501.9 


52513 


1.19 
1.72 


25 - 
100 - 


100 
2500 


-0.998 
-0.961 


0.30 


1668.9 


60722 


1.08 
1.74 


25 - 
100 - 


100 
3000 


-0.997 
-0.969 


0.25 


1826.2 


64781 


0.97 
1.78 


25 - 
100 - 


100 
3000 


-0.997 
-0.969 



TABLE III: Fitting values of exponent a of the power law distri- 
bution of coordination degree for selected values of w. We com- 
pute two linear fittings different in the choice of fitting range 
of coordination degree. For each fitting the range of coordina- 
tion degree and its correlation coefficient are reported. In the 
second column the average degree is shown; the third column 
gives the maximum value of the coordination degree. 



soon as {z) assume a value greater than 1, a giant con- 
nected component appears, that is a component whose 
size is much greater than the size of all other compo- 
nents, and that represents an important fraction of all 
graph's nodes. 

A better theorical comparison model could be repre- 
sented by generalized random graphs endowed with a 
specific degree-distribution. These can be generated via 
the Monte-Carlo algorithm (following the work in Q50 of 
Burda et al.) . In particular, starting from a random graph 
of V nodes and E edges, making local graph transforma- 
tions which leave the number of nodes and the number 
of edges constant and accepting them with a probabil- 
ity which depends on the desired equilibrium degree dis- 
tribution (Metropolis algorithm), we have generated a 
collection of random graphs with the same coordination 
degree distribution and the same average degree as some 
of our protein sequences graphs. 
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For each of them we observe a fundamentally different 
distribution of connected components in the protein se- 
quences graphs and in the random graphs. In the latter 
model the power law behaviour is absent, while there is 
a always a dominant giant connected component, much 
larger than the many other small components, whose 
size distribution decreases exponentially (See Figure [7]) . 



Degree distribution of random graphs 
with assigned degree distribution ( < z ) = 0.57 )) 




Connected components distribution of random graphs 
with assigned degree distribution ( <z > = 0.57 ) 



A(s) = ^k"*'" A ' . Figure [8] shows both the observed de- 

gree distribtution and the approximated degree distribu- 
tion obtained by means of n(s) of the graph at w = 0.95. 



Approximation of degree distribution 
by means of size connected components distribution (w - 0.95) 



approximated degree distribution 
observed degree distribution 




FIG. 7: Top: coordination degree distribution of the collection 
of random graphs generated via Monte-Carlo algorithm fixing 
the equilibrium degree distribution equal to that one observed 
in the protein sequences graph at w = 0.99 and fixing the av- 
erage degree equal to (z) = 0.57. Bottom: size distribution of 
connected components of the random graphs. 

By comparison, in the Simap protein sequences space 
the coordination degree distribution fn,(z) and the con- 
nected component distribution riyj(s) are strongly cor- 
related. The former, for example, can be reproduced 
quite well by means of n„j(s). Let the index i label all 
connected components and let us consider all possible 
edges between nodes belonging to a connected compo- 
nents of size s^, then the cluster would be a complete 
subgraph and all its Sj nodes would have coordination 
degree equal to Zi = Si — 1. If this were true for all 
connected components then all clusters would be com- 
plete subgraphs and we would expect a coordination de- 
gree distribution equal to fw{z) ~ (s n,j(s))| s=z+ i. In 
our graphs, although complete connected components 
are present, the majority of clusters have only a high 
average degree distribution, not equal to its size minus 
one, as in complete graphs. However let's consider a 
component with size Si and a number of edges equal to 
mi, the quantity A; = - .f™!^ represents the fraction of 
edges that are present in the z-th component respect to 
the number of edges that would be present if the com- 
ponent were a complete subgraph (i.e. s;(s; — l)/2). 
Introducing A t as a measure of edges' density for each 
component we can approximate the coordination de- 
gree distribution fw{z) by means of the size connected 
component distribution n^s) too. Specifically we find 
that the coordination degree distribution behaves like 
fw{z) ~ A(z + 1) (z + 1) riui(z + 1), where A(s) is the 
edges' density averaged over all components of size s: 



FIG. 8: Observed degree distribution (black curve) and the ap- 
proximated degree distribution (red curve) obtained by means 
of n(s) of the graph at w = 0.95. 



VI. GIANT COMPONENT 

An interesting phenomenon occurs when w value de- 
crease; we see the formation of the giant component. In 
Figure [9k the behaviour of the fraction of nodes belong- 
ing to the largest component is shown. 

Starting from approximately w ~ 0.65 the largest com- 
ponent begins to expand its size capturing a lot of smaller 
components. Furthermore the components which are 
disconnetted at w ~ 0.675 and which go to form the 
giant component at w ~ 0.65 are samples of many dif- 
ferent sizes, from small components to very big compo- 
nents. This phenomenon becomes more and more evi- 
dent for lower values of w, when the coordination degree 
distribution of the giant component follows a power law 
scaling. This is evident also from Figure |6(b)[ where we 
plot the distribution of the coordination degree for the 
whole set of proteins. The exponent a(w) of the power 
law behavior fw(z) ~ z~ a (™) varies slightly between the 
regions corresponding to small values of the coordina- 
tion degree z and to large values of z. Clearly when a 
giant component exists, the region with large z is largely 
determined by the giant component itself. In Table HTT1 we 
report the fitting values of the exponent a(w) computed 
in two regions with small and large values of z. As we 
decrease the value of w, the two fitting values of a{w) be- 
come more and more divergent. In fact, since the largest 
component is growing, the tail of the distribution fw(z) 
becomes more and more important and assumes a power 
law behavior characterized by a different exponent. 

A significant fact goes with the rapid size increase of 
the largest component. In Table [IV| we show, for each 
w, the fraction of different kingdoms and the number of 



7 



w 


d 


size 


bacteria 


viruses 


plants 


invertebrates 


vertebrates 


number or different species 


0.975 





1581 


8322 





000 


1.000 





000 





000 





000 


4 


0.950 





2236 


15955 





000 


1.000 





000 





000 





000 


4 


0.925 





2739 


47687 





000 


1.000 





000 





000 





000 


10 


0.900 





3162 


50729 





000 


1.000 





000 





000 





000 


14 


0.875 





3536 


51028 





000 


1.000 





000 





000 





000 


14 


0.850 





3873 


51405 





000 


1.000 





000 





000 





000 


14 


0.825 





4183 


51969 





000 


1.000 





000 





000 





000 


29 


0.800 





4472 


52097 





000 


1.000 





000 





000 





000 


29 


0.775 





4743 


52881 





000 


1.000 





000 





000 





000 


29 


0.750 





5000 


63003 





000 


1.000 





000 





000 





000 


60 


0.725 





5244 


118777 





000 


1.000 





000 





000 





000 


67 


0.700 





5477 


120974 





000 


0.999 





000 





000 





000 


106 


0.675 





5701 


145278 





002 


0.997 





000 





000 





000 


302 


0.650 





5916 


224310 





002 


0.749 





001 





000 





248 


988 


0.625 





6124 


272426 





014 


0.662 





010 





007 





306 


4384 


0.600 





6325 


297280 





028 


0.643 





015 





Oil 





303 


7854 


0.575 





6519 


318472 





032 


0.613 





027 





015 





313 


9668 


0.550 





6708 


362379 





047 


0.554 





035 





024 





341 


11437 


0.525 





6892 


404788 





049 


0.526 





047 





029 





349 


15593 


0.500 





7071 


450072 





065 


0.482 





055 





033 





365 


16272 


0.475 





7246 


584371 





084 


0.379 





151 





037 





349 


20957 


0.450 





7416 


718286 





114 


0.312 





194 





041 





340 


35346 


0.425 





7583 


975629 





151 


0.229 





184 





095 





341 


68338 


0.400 





7746 


1202753 





181 


0.188 





209 





096 





326 


76230 


0.375 





7906 


1435734 





210 


0.160 





224 





093 





312 


77970 


0.350 





8062 


1739772 





254 


0.133 





236 





087 





291 


80100 


0.325 





8216 


2059217 





288 


0.117 





239 





083 





273 


82714 


0.300 





8367 


2383804 





316 


0.102 





244 





080 





258 


84953 


0.275 





8515 


2728214 





350 


0.090 





243 





078 





239 


86151 


0.250 





8660 


3071192 





374 


0.083 





240 





076 





226 


90357 


0.225 





8803 


3420697 





396 


0.078 





239 





074 





213 


94210 


0.200 





8944 


3807556 





416 


0.076 





237 





073 





199 


101358 


0.175 





9083 


4210208 





432 


0.074 





234 





072 





188 


102774 


0.150 





9220 


4651704 





446 


0.072 





233 





073 





177 


103831 


0.125 





9354 


5049016 





455 


0.069 





235 





073 





167 


104227 



TABLE IV: For each fixed values of barw, we computed the percentage of proteins, among those belonging to the largest component, 
that come from the five kingdoms. 



different species which appear in the largest connected 
component. Down to around w = 0.675 only proteins 
coming from viruses belong to the largest component 
and, moreover this largest cluster has not yet become 
giant with respect to smaller clusters. For w < 0.675 
the formation of a giant component begins, and simulta- 
neously all kinds of kingdoms enter in the species com- 
position of the giant cluster. This is also evident from 
Figure [9b, where we plot the fraction of the number of 
species belonging to the largest component. This ratio 
increases rapidly around the same value of w. These 
processes continue for lower values of w, with the giant 
component including more and more proteins belonging 
to many different species, and the ratio for each kingdom 
tends to become the same as that of the whole database. 
Furthermore around w ~ 0.475 there is a very sharp in- 
crease both in the dimension of the giant component and 
especially in the number of species present in it, as it is 
evident from Figures [9^ and [9p. 



The processes just described may indicate the presence 
of a phase transition: we have two different phases, one 
for large values of in, characterized by the presence of 
clusters with similar dimensions and with the largest one 
composed especially of viruses, and the second phase 
characterized by the presence of a giant component com- 
posed of different species alongside other small little 
clusters. We note however that the phase transition is 
not sharp, but the changes in the dimension and com- 
position of the largest component are spread in a range 
0.475 < w < 0.675. We also note that the plot in Fig- 
ure^ has a very rapid increase for w ~ 0.475. 

In Table |Vl for each w, it can be seen how different 
kingdoms are distributed in connected components. In 
particular we count the number of components, whose 
size is greater than 90 and record the percentage of 
clusters whose proteins come from species of only one 
kingdom, only from a pair of kingdoms, etc., up to the 
percentage of connected components which contain pro- 
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w 


0.95 


0.90 


0.85 


0.80 


0.75 


0.70 


0.65 


0.60 


0.55 


0.50 


0.45 


0.35 


0.25 


0.15 


bacteria 


9.6 


12.2 


14.2 


17.2 


21.9 


22.6 


23.6 


23.8 


23.9 


25.1 


25.8 


29.0 


35.6 


57.0 


viruses 


32.7 


31.4 


24.3 


17.6 


11.4 


7.4 


5.2 


3.8 


2.9 


2.7 


2.4 


2.7 


4.2 


7.5 


plants 


9.3 


10.8 


11.4 


9.4 


8.3 


7.3 


7.6 


7.8 


7.7 


7.5 


7.5 


6.2 


4.0 


0.0 


invertebrates 


11.6 


8.9 


7.4 


5.8 


3.6 


3.2 


2.5 


2.0 


1.6 


1.5 


1.2 


1.4 


1.3 


1.1 


vertebrates 


22.9 


23.0 


25.4 


25.7 


25.6 


25.9 


23.6 


20.0 


17.1 


13.0 


10.2 


5.2 


2.8 


1.1 


bac-vir 


2.7 


2.2 


2.1 


2.1 


1.6 


1.6 


1.4 


1.0 


1.0 


1.1 


1.0 


1.7 


2.4 


3.2 


bac-pla 


1.6 


1.8 


2.8 


2.9 


3.5 


4.5 


5.9 


7.0 


8.5 


8.9 


9.1 


10.8 


11.3 


18.3 


bac-inv 


0.5 


0.4 


0.7 


0.7 


0.8 


0.9 


1.3 


1.7 


2.1 


2.1 


2.0 


2.6 


3.0 


1.1 


bac-ver 


1.8 


2.0 


2.4 


2.3 


1.9 


1.9 


1.8 


1.6 


1.5 


1.5 


1.3 


1.1 


1.1 


1.1 


vir-pla 


0.2 


0.1 


0.2 


0.4 


0.3 


0.4 


0.3 


0.3 


0.2 


0.2 


0.2 


0.2 


0.5 


0.0 


vir-inv 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


vir-ver 


0.2 


0.5 


0.7 


0.8 


0.9 


0.7 


0.6 


0.4 


0.3 


0.2 


0.1 


0.2 


0.1 


0.0 


pla-inv 


0.9 


0.0 


0.0 


0.0 


0.0 


0.2 


0.1 


0.1 


0.1 


0.2 


0.3 


0.2 


0.5 


0.0 


pla-ver 


0.5 


0.9 


0.8 


1.1 


1.3 


1.0 


1.1 


1.2 


1.2 


1.0 


0.9 


1.3 


1.7 


1.1 


inv-ver 


0.5 


1.1 


2.6 


4.5 


7.0 


8.4 


9.2 


10.3 


10.9 


11.2 


11.0 


9.0 


5.5 


0.0 


bac-vir-pla 


0.0 


0.4 


0.3 


0.5 


0.3 


0.3 


0.4 


0.2 


0.2 


0.2 


0.4 


0.4 


0.7 


1.1 


bac-vir-inv 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.1 


0.0 


0.1 


0.1 


0.3 


1.1 


bac-vir-ver 


0.2 


0.1 


0.0 


0.0 


0.0 


0.1 


0.1 


0.1 


0.1 


0.1 


0.2 


0.2 


0.2 


0.0 


bac-pla-inv 


0.0 


0.0 


0.1 


0.2 


0.5 


0.6 


0.8 


0.9 


1.3 


2.0 


2.3 


2.4 


3.1 


1.1 


bac-pla-ver 


0.0 


0.1 


0.0 


0.0 


0.1 


0.3 


0.6 


0.6 


0.9 


1.0 


1.3 


1.7 


1.4 


0.0 


bac-inv-ver 


0.0 


0.0 


0.1 


0.3 


0.4 


0.4 


0.4 


0.9 


0.8 


0.9 


0.9 


1.0 


0.8 


1.1 


vir-pla-inv 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


vir-pla-ver 


0.0 


0.0 


0.0 


0.0 


0.1 


0.1 


0.1 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


vir-inv-ver 


0.0 


0.0 


0.1 


0.2 


0.1 


0.2 


0.3 


0.2 


0.2 


0.2 


0.2 


0.1 


0.1 


0.0 


pla-inv-ver 


0.9 


1.4 


1.8 


5.5 


7.3 


8.4 


9.4 


11.0 


11.3 


12.0 


12.4 


13.4 


11.7 


0.0 


bac-vir-pla-inv 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.1 


0.1 


0.1 


0.1 


0.2 


0.0 


0.0 


bac-vir-pla-ver 


0.0 


0.0 


0.1 


0.1 


0.1 


0.1 


0.2 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


0.0 


bac-vir-inv-ver 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.1 


0.1 


0.0 


bac-pla-inv-ver 


0.2 


0.1 


0.4 


0.7 


1.0 


2.1 


2.5 


3.8 


5.1 


6.4 


8.0 


7.6 


6.7 


0.0 


vir-pla-inv-ver 


0.0 


0.1 


0.0 


0.0 


0.1 


0.1 


0.2 


0.3 


0.3 


0.3 


0.2 


0.2 


0.1 


1.1 


bac-vir-pla-inv-ver 


0.0 


0.0 


0.1 


0.1 


0.1 


0.2 


0.2 


0.1 


0.2 


0.3 


0.5 


0.7 


0.4 


1.1 



TABLE V: Spread of species in connected components. Each value indicates the percentage of clusters, calculated on clusters having 
size greater than 90, composed by proteins coming from only one kingdom, only from a pair of kingdoms, etc., up to the percentage 
of clusters composed by proteins of all kingdoms. 



teins of all kingdoms. For high values of w the majority 
of clusters are made up of proteins belonging to only one 
kingdom, in particular the kingdom of viruses; clusters 
with proteins of different kingdoms are very scarce. As 
expected, as w decreases, the percentage of clusters be- 
longing to only one kingdom decreases in favor of clus- 
ters of mixed kingdom composition. 

It is interesting to note that the virus kingdom has a 
very low tendency to cluster with the other kingdoms, in 
particular with plants and animalia. Furthermore, for no 
values of w do we see the formation of components (of 
size greater than 90) with proteins coming from viruses 
and invertebrates, and from viruses, plants and inverte- 
brates. Virus proteins cluster mainly with bacterial pro- 
teins. In addition we observe that bacterial proteins clus- 
ter mainly with plant proteins and vice versa. Moreover, 
although plant proteins cluster infrequently with inver- 
tebrates and with vertebrates, there are many more clus- 
ters consisting simultaneously of plant, invertebrate and 
vertebrate proteins. Finally we note that at the lowest 
value of w, the majority of components which are not in- 
cluded in the giant component are clusters consisting of 
bacterial proteins, of bacterial and plant proteins and of 



virus proteins. 



VII. ANALYSIS OF THE PROTEINS THAT CONNECT 
CLUSTERS 

Protein pairs that connect clusters in the different 
weight intervals are of special interest as they harbor 
the most conserved sequence regions that are shared by 
the interconnected clusters. We want to know if cer- 
tain sequence features and protein domains are enriched 
in these proteins compared to the complete proteome. 
Therefore we have calculated for all protein contained in 
SIMAP some sequence features: length, isoelectric point 
(using the EMBOSS sequence analysis package J6D), low 
complexity content (using the program seg [7]) and the 
number of predicted transmembrane segments (using the 
program TMHMM [8]). Additionally, in order to de- 
rive functional information for all proteins, we have pre- 
dicted signal peptides (using SignalP 3.0 [9[]), localization 
signab (using TargetP 1.1[10]) and protein domains (us- 
ing the databases PFAM, TIGRFAM, PANTHER, SUPER- 
FAMILY, SMART and PIRSF from InterPro 12.1 ifilh for 



9 



Behaviour of giant component 




L'Mxiili of proteins, joii "ihiii small olLisK'rs 



SDQQ 
ifOO 
1BQQ 
1400 
1200 
■L 1 000 



(a) 
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FIG. 9: (a) Fraction of nodes belonging to the largest cluster for 
each value of w. (b) Fraction of species present in the largest 
cluster for each value of w. 
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(b) 

FIG. 10: Length representation of (a) proteins joining generic 
clusters and of (b) proteins joining the largest cluster. The red 
color encodes overrepresented lengths; the blue color indicates 
underrepresented lengths. 



all SIMAP proteins. 

For all weight intervals we have counted the feature 
occurrence in the proteins that connect clusters; these 
proteins are all pairs of sequences which belong to dif- 
ferent clusters in the graph built at wi and belonging to 
the same cluster in the graph built at w 2 , where w 2 < uii 
are two consecutive values of the weight w. We have also 
distinguished between two disjoint sets of these proteins: 
proteins linking the clusters that will form the largest 
cluster in the graph built at w 2 and proteins linking the 
other generic clusters. 

The enrichment (e) of features was calculated as ratio 
of the number of features found (fc) and the number of 
features expected (&#): e = k/ks- The number of fea- 
tures expected was calculated by: ks = Kn/V, where 
n is the number of proteins of interest (e.g. connecting 
clusters in a given weight interval), K denotes the num- 
ber of proteins used for clustering having the given fea- 
ture and V corresponds to the number of proteins used 
for clustering. 



A. Results 

Proteins joining clusters outside the largest cluster 
show an over-representation of lengths around 400aa 
(Figure |10(a)| ), contain overrepresented proteins of 
small low complexity content (Figure [11(a) | ), are often 
neutral or weakly acidic (Figure [l2(a)D and contain more 
transmembrane proteins than expected (Figure |13(a)p . 
Proteins joining clusters in the giant component are char- 
acterized by short and very long lengths (Figure [l0(b)| ), 
reduced low complexity content (Figure [ll(b)D , acidic 
and alkaline proteins, dependent on the weight interval 
(Figure |12(b)| ) and a high number of transmembrane 
domains in the lower weight intervals (Figure |13(b)| ). 
Signal peptides were found overrepresented in proteins 
joining clusters outside the largest component at the 
lower weight intervals; at higher weight intervals and 
in proteins joining clusters in the largest component they 
were found underrepresented, as were localization sig- 
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Low complexity content of proteins joining small clusters 
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FIG. 11: Representation of the low complexity content of (a) 
proteins joining generic clusters and of (b) proteins joining the 
largest cluster. The red color encodes overrepresented values; 
the blue color indicates underrepresented values. 



FIG. 12: Representation of the isoelectric points of (a) proteins 
joining generic clusters and of (b) proteins joining the largest 
cluster. The red color encodes overrepresented values; the blue 
color indicates underrepresented values. 



nals in all proteins joining clusters (Figure |14(a)| and 
Figure |14(b)j ). For all considered weight intervals we 
could find interval-specific overrepresented and under- 
represented protein domains (Figure |15(a)| and |15(b)| ) . 
Remarkably these domains are not only specific for a cer- 
tain weight interval, but also different for proteins join- 
ing clusters outside the largest component and proteins 
joining clusters in the largest component (See Table iVll) . 



the largest component contains proteins that are differ- 
ent from those contained in other large clusters. These 
findings are complemented by the observation of specific 
over- and underrepresented functional domains in the 
proteins connecting clusters at certain weight intervals. 
Thus we conclude that for each weight interval a small 
number of protein families is responsible for cluster in- 
terconnections. 



B. Discussion 



VIII. CONCLUSIONS 



All of the analyzed sequence features indicate that 
proteins that join clusters at a certain weight interval 
are not distributed equally over the complete protein 
space. For all of the features we could find specific 
under- and over-representation. Proteins joining clusters 
outside the largest component and proteins joining clus- 
ters in the largest component are different with respect 
to almost all considered features, which indicates that 



We investigated the local e global properties of the 
sequence similarity space formed by all proteins in the 
SIMAP database, which contains more than 5.5 millon 
amino acid sequences. We represented this space as 
a graph whose vertices are proteins and the edges are 
weighted to reflect the similarity between the corre- 
sponding pairs of sequences (high weight, high similar- 
ity). The choice of this weight formula ([T]) came from the 
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Signal peptides and localization predictions in proteins joining small clusters 
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FIG. 13: Representation of the predicted number of transmem- 
brane helices of (a) proteins joining generic clusters and of (b) 
proteins joining the largest cluster. The red color encodes over- 
represented values; the blue color indicates underrepresented 
values. 



to 



FIG. 14: Representation of the predicted signal peptides and 
protein localization signals of (a) proteins joining generic clus- 
ters and of (b) proteins joining the largest cluster. The red 
color encodes overrepresented values; the blue color indicates 
underrepresented values. 



necessity to compare the similarity score between pairs 
of sequences that could have different lengths. The SW 
score was therefore modified by means of the self-score 
geometric mean which contains the length information 
of the two aligned sequences. 

Then, keeping only edges with w > w, we built a col- 
lection of graphs by varing w. From the analysis of the 
connected components we found that these graphs do 
not belong to the class of random graphs, whereas they 
are characterized by a power law behaviour both in the 
size cluster distribution and in the coordination degree 
distribution and for each fixed w these two distributions 
are strongly related to each other. 

With the variation of w, we found interesting changes 
in the global organization of the protein homology net- 
works: we observed two different phases, one for large 
values of w, characterized by the presence of clusters 
with similar dimensions, each composed essentially by 
proteins belonging to only one kingdom and with the 



largest one composed especially by viruses, and the sec- 
ond phase, for lower values of w, characterized by the 
presence of a giant component composed by different 
species and other very little clusters. 

In the end we investigated sequence features and func- 
tional informations of protein pairs that are responsible 
of the connection of clusters in the different intervals of 
w, since they harbor the most conserved sequence re- 
gions that are shared by the interconnected clusters. We 
found that proteins joining clusters outside the largest 
component and proteins joining clusters in the largest 
component are different with respect to almost all con- 
sidered features, which indicates that the largest compo- 
nent contains proteins that are different from those con- 
tained in other large clusters. Indeed we found an over- 
representation of a small set of domains which shows 
that a small number of protein families is responsible for 
cluster interconnections. 

The analysis we performed gives a first view of the 
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Protein domain signatures in proteins joining small clusters 



: 



global organization of the greatest protein homology net- 
work ever been built before. It is the first step and the 
starting point to answer to other global or local interest- 
ing questions which could confirm that the protein ho- 
mology network is structured with respect to functional 
and evolutionary properties. 
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Proiein domain signatures in proteins joining the giani component 




(b) 

FIG. 15: Representation of the predicted protein domains of (a) 
proteins joining generic clusters and of (b) joining the largest 
cluster. Each line in the graph denotes a certain domain. The 
red color encodes overrepresented values; the blue color indi- 
cates underrepresented values. 
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Wi —¥ W2 


e Proteins joining generic clusters 


e Proteins joining the largest cluster 


0.750 -> 0.725 


0.02 PF00598 FluJVIl 
0.03 PF00522VPR 
0.03 PF00540 Gag.pl 7 
0.03 PF00951 ArterLGl 
0.03 PF00971 EIAV.GP90 

9.40 PF02916 DNA.PPF 
11.09 PF07095 IgaA 
11.25 PF08272 Topo.Zn_Ribbon 
11.83 PF06899WzyE 
12.46 PF06788 UPF0257 


0.93 PF00078 RVT.l 
1.08 PF00075 RnaseH 
1.44 PF06815 RVT.connect 
1.46 PF07075 DUF1343 
2.19 PF00665rve 

15.41 PF00607 Gag.p24 
18.79 PF00517 GP41 
18.91 PF02022 Integrase_Zn 
27.07 PF00540 Gag.pl7 
137.49 PF00516GP120 


0.725 -> 0.700 




0.88 PF00078 RVT.l 
1.16 PF00077RVP 
1.91 PF06817RVT_thumb 
3.68 PF00075 RnaseH 
3.77 PF00665 rve 

37.19 PF00186 DHFR.1 
80.26 PF00098 zf-CCHC 
129.77 PF00516GP120 
139.92 PF00607 Gag.p24 
145.50 PF00540Gag.pl 7 


0.700 -> 0.675 


0.01 PF00516GP120 
0.01 PF00522VPR 
0.01 PF00602 Flu.PBl 
0.01 PF00603 Flu_PA 
0.01 PF01539 HCV.env 

10.14 PF08435 Calici.coat.C 
10.22 PF03296 Pox_polyA_pol 
12.94 PF05733 TenuLN 
12.98 PF03805 CLAG 
13.68 PF00897 Orbi.VP7 


0.12 PF00098 zf-CCHC 
0.15 PF00271 Helicase.C 
0.22 PF00078 RVT.l 
1.02 PF01560 HCV.NS1 
1.16 PF06817RVT.thumb 

15.62 PF02907 Peptidase.S29 
19.47 PF00517 GP41 
57.66 PF00516 GP120 
74.03 PF00077 RVP 
98.38 PF02348 CTP.transf_3 


0.675 -> 0.650 


0.01 PF00064 Neur 
0.01 PF00469 F-protein 
0.01 PF00506 Flu.NP 
0.01 PF00516GP120 
0.01 PF00540 Gag.pl 7 

11.63 PF04310 MukB 
12.71 PF07108 PipA 
13.48 PF07429 Fuc4NAc.transf 
15.20 PF03506 Flu.CNSl 
15.26 PF06593 RBDV.coat 


0.10 PF00078 RVT.l 
0.13 PF00077RVP 
0.18 PF00560LRR.1 
0.18 PF00607 Gag.p24 
0.30 PF00665 rve 

151.92 PF02959 Tax 
168.64 PF00758 EPO.TPO 
431.37 PF08300 HCV.NS5a.l 
441.03 PF08301 HCV_NS5a.lb 
483.96 PF01506 HCV.NS5a 


0.650 -> 0.625 


0.01 PF00506 Flu.NP 
0.01 PF00516GP120 
0.01 PF00540 Gag.pl 7 
0.01 PF00603 Flu_PA 
0.01 PF00695 vMSA 

12.57 PF06952 PsiA 
13.73 PF06788 UPF0257 
14.79 PF05788 OrbLVPl 
15.42 PF00901 Orbi.VP5 
16.02 PF03753 HHV6-IE 


0.03 PF00096 zf-C2H2 
0.04 PF00078 RVT.l 
0.17 PF00023Ank 
0.17 PF00589 Phagejntegrase 
0.19 PF00903 Glyoxalase 

202.08 PF01002 Flavi_NS2B 
221.93 PF01349 Flavi_NS4B 
222.59 PF01353 GFP 
229.23 PF01350 Flavi.NS4A 
243.38 PF00948 FlavLNSl 



[14] These sequences come from databases: PDB proteins, 
mips non-redundant protein database, UNIPROT SWIS- 
SPROT, UNIPROT-TrEMBL, PFAM sequences, Eukaryotic sig- 
nature proteins. 



[15] The values for background distribution of amino acids 
come from data used for the PAM matrix: p A = 
0.096; p R = 0.034; p N = 0.042; p D = 0.053; p c = 
0.025; p Q = 0.032; p E = 0.053; p G = 0.090; p H = 
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0.625 -4 0.600 


0.01 PF00124 Photo_RC 
0.01 PF00603 Flu.PA 
0.01 PF00695 vMSA 
0.01 PF01560 HCV.NS1 
0.02 PF00223 PsaA_PsaB 

11.95 PF06517Orthopox_A43R 
12.09 PF00843 Arena_nucleocap 
13.08 PF06802 DUF1231 
14.72 PF05273 Pox_RNA_Pol.22 
16.90 PF03021 CM2 


0.09 PF00009 GTP_EFTU 
0.13 PF07974EGF.2 
0.2 PF00096 zf-C2H2 
0.22 PF00560 LRR_1 
0.23 PF01546Peptidase.M20 

376.41 PF01002 FlavLNS2B 
403.70 PF00948 FlavLNSl 
411.72 PF01349 Flavi_NS4B 
425.27 PF01350 FlavLNS4A 
538.21 PF05408 Peptidase_C28 


0.600 -4 0.575 


0.01 PF00517GP41 
0.01 PF00559W 
0.01 PF00600 FluJNTSl 
0.01 PF00969 MHCJLbeta 
0.01 PF06815 RVT.connect 

10.54 PF02477 Nairo_nucleo 
11.95 PF07982 Herpes_UL74 
12.30 PF06871 TraH_2 
14.14 PF02509 Rota_NS35 
16.04 PF06929 Rotavirus.VP3 


0.06 PF00096 zf-C2H2 
0.06 PF00097 zf-C3HC4 
0.09 PF00009 GTP_EFTU 
0.09 PF01266 DAO 
0.11 PF01926MMR_HSR1 

133.87 PF05790 C2-set 
139.12 PF01353 GFP 
150.11 PF00518E6 
195.29 PF02929 BgaLsmalLN 
231.71 PF01382 Avidin 


0.575 -4 0.550 


0.01 PF00016 RuBisCOJarge 
0.01 PF00113 Enolase.C 
0.01 PF00123 Hormone_2 
0.01 PF00506 Flu_NP 
0.01 PFO 1 1 OxidorecLql _C 

10.60 PF06134RhaA 
10.95 PF07095 IgaA 
11.75 PF00897 Orbi_VP7 
12.13 PF03294 Pox.Rap94 
13.75 PF01295 Adenylate.cycl 


0.02 PF00115 COX1 
0.07 PF07690 MFS.l 
0.08 PF07993 NAD_binding_4 
0.09 PF00517GP41 
0.10 PF00583 AcetyltransLl 

161.43 PF01140Gag_MA 
168.19 PF04528 Adeno.E4.34 

173.44 PF08377 MAP2_projctn 
184.23 PF02093 Gag_p30 
311.32 PF01141 Gag_pl2 


0.550 -4 0.525 


0.01 PF00016 RuBisCOJarge 
0.01 PF00516GP120 
0.01 PF00522VPR 
0.01 PF00540 Gag.pl 7 
0.01 PF01539 HCV.env 

11.29 PF05928 Zea_mays_MuDR 

11.62 PF06829 DUF1238 

11.63 PF03277 Herpes_UL4 

11.64 PF03395 Pox_P4A 
12.73 PF08405 CalicLPP.N 


0.06 PF00067 p450 
0.07 PF00023 Ank 
0.08 PF00097 zf-C3HC4 
0.11 PF01381 HTFL3 
0.11 PF04851 ResIII 

101.41 PF01537Herpes_glycop.D 
121.18 PF02929 BgaLsmalLN 
123.25 PF01376Enterotoxin.b 
128.24 PF06466 PCAFJJ 
147.36 PF05806 Noggin 


0.525 -4 0.500 


0.01 PF00600 FluJMSl 
0.01 PF00869 Flavi.glycoprot 
0.01 PF01539 HCV.env 
0.01 PF02461 AMO 
0.01 PF02788 RuBisCOJargeJM 

11.36 PF07434CblD 
11.80 PF04913 Baculo.Y142 
11.98 PF05880 FijL64_capsid 
13.48 PF06306 CgtA 
13.98 PF03317ELF 


0.02 PF00106 adh_short 
0.04 PF00270 DEAD 
0.05 PF00037 Fer4 
0.06 PF02518 HATPase_c 
0.08 PF00249 Myb_DNA-binding 

68.92 PF03939 Ribosomal_L23eN 
72.11 PF06267 DUF1028 
96.66 PF02022 Integrase_Zn 
120.34 PF00552 Integrase 
129.98 PF02929 BgaLsmalLN 



TABLE VI: For proteins joining clusters outside the largest component or joining the giant component the five mostly underrepre- 
sented and five mostly overrepresented PFAM domains are giver per interval of weight w. 



0.034; pi = 0.035; p L = 0.084; p K = 0.085; p M = 
0.012; p F = 0.045; p P = 0.041; p s = 0.057; p T = 
0.062; p w = 0.012; p Y = 0.030; p v = 0.078. 



They can be obtained from 

http://apps. bioneq. qc. ca/twiki/pub/Knowledgebase/PAM/ 
PAM2.JPG 



